• DOMAIN: Digital content and entertainment industry

• CONTEXT: The objective of this project is to build a text classification model that analyses the customer's sentiments based on their reviews in the IMDB database. The model uses a complex deep learning model to build an embedding layer followed by a classification algorithm to analyse the sentiment of the customers.

• DATA DESCRIPTION: The Dataset of 50,000 movie reviews from IMDB, labelled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocabulary size of 10,000. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

• PROJECT OBJECTIVE: Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments.

Steps and tasks: [ Total Score: 30 points]

  1. Import and analyse the data set.

Hint: - Use imdb.load_data() method

  1. Perform relevant sequence adding on the data
  2. Perform following data analysis: • Print shape of features and labels • Print value of any one feature and it's label
  3. Decode the feature value to get original sentence
  4. Design, train, tune and test a sequential model. Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN classifiers. Be analytical and experimental here in trying new approaches to design the best model.
  5. Use the designed model to print the prediction on any one sample.

Import and analyse the data set.

Get train and test set and Take 10000 most frequent words

Importing required libraries and loading the inbuilt dataset from tensorflow as mentioned in the hint

In the dataset provided, we shall split the dataset in 4 parts. Unseen data will be used for final comparison of prediction and actual value. Train data is used to train the model and validation dataset is used to validate the model. Test data is used to find the overal accuracy of the model.

Checking the data type and shape of the dataset

Check random record in the dataset

Get indices for words from the IMDB dataset and print random train data.

printing first 5000 word in the word_to_id items

'the', 'and' , 'a' are most frequent used words. Here we are not using stopwords to remove the words from the sentences. These words will carry meaning to the sentiments and hence will not be removed before processing.

Perform relevant sequence adding on the data

Adding padding to the data with maximum length of 300. Here we are pre padding the words and truncating the words in sentences which are having more the 300 word length

Design, train, tune and test a sequential model.

We are building the model with simple LSTM layer and dense layer as sigmoid for the output since it is a binary classification and we are using binary_crossentropy as loss.

Here we are generating the embedding with size of 50 for each review. Here the embedding layer will be trained and hence we will see more number of parameters to be trained

define dropout and LSTM. LSTM size of cell and hidden state of 128.

using sigmoid as activation function. This is the output layer.

As we can observe, the number of Non-trainable params are 0 and we there are more number of parameters (weights and biases). Most number fo parameters are in emdedding layer when compare with LSTM layer.

As observed in above cell, the ovrall accuracy is around 85% for the test data. Although the model is having high accuracy on the train data, it may be a over fit model since the test accuracy is way below the train accuracy and the same is with validation accuracy.

Checking model performance using the graphs

The above model is a over fitting model as it is having large gap between validation and train accuracy and loss. As observed the gap is widening.

We can plan to use pre-trained embedding like Glove. GloVe stands for “Global Vectors”. And as mentioned earlier, GloVe captures both global statistics and local statistics of a corpus, in order to come up with word vectors. GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence) to obtain word vectors.

using gensim library and load the glove embedding. The basic idea behind the GloVe word embedding is to derive the relationship between the words from statistics. Unlike the occurrence matrix, the co-occurrence matrix tells you how often a particular word pair occurs together. Each value in the co-occurrence matrix represents a pair of words occurring together.

comparing the IMDB words with the global embedding and developing the embeding vector.

Checking random embedding matrix

Building the model using LSTM layer.

Defining LSTM with cell and hidden state length as 128

As we can see in above table, the entire embedding layer is made as non trainable and in a way we are using transfer learning and this particular action is helping us to improve accuracy.

The overall accuracy is aroud 88% and lets check if the model is overfitting model with help of graph

As we can observe in above graphs, the train and validation accuracy is close to each other and difference in loss between is also less. This model is not a overfitting model as previous one.

Use the designed model to print the prediction on any one sample

We are considering only 50 records among the dataset to get the output and compare them. Here we are having on function for displaying the confusion matrix which will help us evaluate the score of the model.

Observation from above graphs are mentioned below

1) There are 15 True negative : True negatives (TN): We predicted no, and they are negative sentiment from viewers of movie.

2) There are 28 True positive. True positives (TP): These are cases in which we predicted yes (they are positive sentiment about the movie), and they are positive sentiment.

3) There are 3 False positive : false positives (FP): We predicted yes, but that particular review was negative. (Also known as a "Type I error.")

4) There are 6 False negative : false negatives (FN): We predicted no, but that particular review was Positive. (Also known as a "Type II error.")

Observation from above graphs are mentioned below

1) There are 2044 True negative : True negatives (TN): We predicted no, and they are negative sentiment from viewers of movie.

2) There are 2278 True positive. True positives (TP): These are cases in which we predicted yes (they are positive sentiment about the movie), and they are positive sentiment.

3) There are 161 False positive : false positives (FP): We predicted yes, but that particular review was negative. (Also known as a "Type I error.")

4) There are 526 False negative : false negatives (FN): We predicted no, but that particular review was Positive. (Also known as a "Type II error.")

Checking few examples of the movie review.

There are only few misclassification observed when comparing the result. Overall the model performance is good and it is providing better results.

Summary

The data set provided was inbuilt in tensorflow which is having 50000 records, which is split into test, validation and unseen data for testing and validation purpose. Here we have implemented 2 models. For model #1, we built our own embedding and for model #2 we used glove model.

The model which is built using glove model was having high accuracy and performance was better overall. The model has very less parameter to be trained and it helped avoid overfitting of the model. In general, it is always advised to use the transfer learning where ever possible so as t increase accuracy and reliability on the model. Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It is a popular approach in deep learning where pre-trained models are used as the starting point on natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems. Compared to word2vec, GloVe allows for parallel implementation, which means that it's easier to train over more data. It is believed (GloVe) to combine the benefits of the word2vec skip-gram model in the word analogy tasks, with those of matrix factorization methods exploiting global statistical information.

As we compared the test data, below are some observation from the graphs.There are lot of True negative and True positive. So the model is doing pretty well for predicting negative sentiment from viewers of movie and for predicting positive sentiment from viewers of movie. Here we also observe 179 False positive and 449 False negative these are "Type I error" and "Type II error", for mitigation of these errors either we have to go with other models like bidirectional LSTM model which takes feeds from both forward and backward direction and thus help us to understand the reviews better.

In this project, we used pre-trained embedding - Glove. GloVe stands for “Global Vectors”. GloVe captures both global statistics and local statistics of a corpus, in order to come up with word vectors. GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence) to obtain word vectors. This embedding helped us a lot in determining correct review of movie and improved accuracy of the model.

DOMAIN: Social media analytics

CONTEXT: Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.In this hands-on project, the goal is to build a model to detect whether a sentence is sarcastic or not, using Bidirectional LSTMs.

DATA DESCRIPTION: The dataset is collected from two news websites, theonion.com and huffingtonpost.com. This new dataset has the following advantages over the existing Twitter datasets: Since news headlines are written by professionals in a formal manner, there are no spelling mistakes and informal usage. This reduces the sparsity and also increases the chance of finding pre-trained embeddings. Furthermore, since the sole purpose of TheOnion is to publish sarcastic news, we get high-quality labels with much less noise as compared to Twitter datasets. Unlike tweets that reply to other tweets, the news headlines obtained are self-contained. This would help us in teasing apart the real sarcastic elements Content: Each record consists of three attributes: is_sarcastic: 1 if the record is sarcastic otherwise 0 headline: the headline of the news article article_link: link to the original news article. Useful in collecting supplementary data Reference: https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection

PROJECT OBJECTIVE: Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments. Steps and tasks: [ Total Score: 30 points]

  1. Read and explore the data
  2. Retain relevant columns
  3. Get length of each sentence
  4. Define parameters
  5. Get indices for words
  6. Create features and labels
  7. Get vocabulary size
  8. Create a weight matrix using GloVe embeddings
  9. Define and compile a Bidirectional LSTM model. Hint: Be analytical and experimental here in trying new approaches to design the best model.
  10. Fit the model and check the validation accuracy

Read and explore the data

Applying some basic analysis on the dataframe.

The dataset is having almost equal number of sarcastic and normal headlines. Although it is not exactly balanced dataset.

Lets check frequent word using library

WordCloud is a technique to show which words are the most frequent among the given text.

Word Clouds (also known as wordle, word collage or tag cloud) are visual representations of words that give greater prominence to words that appear more frequently.

As observed, the above image shows the frequent words used among the given text.

Get length of each sentence by Adding new column to the dataset

PLoting the number of rows against the length of sentence with bins.

Checking max length available in the headlines column

Retain relevant columns

Checking one sample record

Define parameters

Here we are using vocabulary size of 30000 which almost covers almost all the word available in the given dataset. Here we are defining the embedding dimension as 50 and maximum length is 100 for each review.

Creating required train test split on the prepared data set (Create features and labels).

In the project (Module) requirement PDF, it is mentioned that to get the indices first and then create featues and labels

Task are mentioned as below Get indices for words Create features and labels

If we create indices first, then even the test data will also be seen. And in reality there are fair chances that an unseen word may appear in the document.

So here we shall check the performance of the model on both

1) Applying tokenizer on whole dataset and build the model

2) Applying tokenizer after splitting the dataset into train and test.

applying tokenizer on whole dataset

As displayed in above cell, the words are having the index based on their occurance and these indexes will be used to generate the embeddings for the words.

Get vocabulary size

Applying tokenizer after splitting the dataset into train and test.

Get indices for words

Checking length and one sample record

Checking basic statistic about the tokenizer

Create a weight matrix using GloVe embeddings

using gensim library and load the glove embedding. The basic idea behind the GloVe word embedding is to derive the relationship between the words from statistics. Unlike the occurrence matrix, the co-occurrence matrix tells you how often a particular word pair occurs together. Each value in the co-occurrence matrix represents a pair of words occurring together.

Define and compile a Bidirectional LSTM model.

( Fit the model and check the validation accuracy)

Model #1 : Multiple Bidirectional layers and using data of seprate tokenizer. Tokenizer which is applied on X_train and X_test seprately.

Train and validation accuracy is close to each other and there are not a lot of divergance observed in the graph. This model looks like a good fit model and there is very less overfit observed in the model.

Train and validation loss is close to each other and there are not a lot of divergance observed in the graph. This model looks like a good fit model and there is very less overfit observed in the model.

To build this model we are using multiple layers of bidirectional LSTM and each with different hidden and cell state size and with few dropput layers. One thing we need to observe here is that using drop out layer right before sigmoid. This has helped to reduce overfit of the model. This is bit unconventional, since we seldom used dropout layer just above sigmoid output layer. But in this case it has helped to reduce the overfitting of the model.

We are using glove embedding for weights and this has helped to greatly increase accuracy of the model. This has also helped to reduce training time for this model.

Model #2 : Single Bidirectional layers and using data of seprate tokenizer. Tokenizer which is applied on X_train and X_test seprately.

It looks like this model and the one earlier are having same kind of behaviour

Model #3 : Multiple Bidirectional layers and using data of tokenized whole dataset

In bi-directional LSTM model,

1) data from one LSTM gets fed in forward direction

2) While for the other LSTM, data is fed in reverse direction.

The arrangement will aid for better learning of sentences.

Output of LSTM will get Combined. There are different merge mode available (Add, Concat, mul)

The words that come in the sentences can change the meaning and it is recommended to use it.

There is a hige gap in the loss and bit less accurate than other model and hence cannot be used for prediction

Model #4 : Single Bidirectional layers and using data of tokenized whole dataset

There is a hige gap in the accuracy and loss graph, thsi model looks like a overfit model and hence cannot be used for prediction

Observation from above graphs are mentioned below

1) There are 2658 True negative : True negatives (TN): We predicted no, and they are negative sentiment from viewers of movie.

2) There are 2257 True positive. True positives (TP): These are cases in which we predicted yes (they are positive sentiment about the movie), and they are positive sentiment.

3) There are 501 False positive : false positives (FP): We predicted yes, but that particular review was negative. (Also known as a "Type I error.")

4) There are 282 False negative : false negatives (FN): We predicted no, but that particular review was Positive. (Also known as a "Type II error.")

Summary

In the given project objective is to determine if the headline is sarcastic or not. For predicting same we have used bidirectional LSTM models. We have evaluated four models. Two model are fed data where in entire dataset was tokenised at once, where as other two models are fed with data after spliting the input data into train and test split and then tokenising them. The ideas here is to mask the data altogether as a validation data and then provide it to the model for validation purpose. This method has worked out good and the accuracy of the model observed is also good.

In this project we are using glove embedding, by this way we are transfer learning and reducing the number of weights and biases which has to be learnt and thus reducing the resource and execution time and increasing the accuracy of the model. Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It is a popular approach in deep learning where pre-trained models are used as the starting point on natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems. Compared to word2vec, GloVe allows for parallel implementation, which means that it's easier to train over more data. It is believed (GloVe) to combine the benefits of the word2vec skip-gram model in the word analogy tasks, with those of matrix factorization methods exploiting global statistical information.

We have evaluated four model, among them the model #1 which is having multiple bidirectional LSTM layers and which is trained on the data which is split tokenised has worked well and for the same model the accuracy score are mentioned in above cells. The accuracy of model when compared with train and validation it looks like the model is a good fit and there is very less overfit observed in the model. We are using glove embedding for weights and this has helped to greatly increase accuracy of the model. This has also helped to reduce training time for this model. Model #1 and model#2 seems to be performing in same way but the validation and observation on these two model can be ascertained only upon training with more data. Model #1 training time is bit more then model #2 for given number of records and hence it is preferred to run these two models on larger set of data to choose perfect model for production use.